AMALGAM: Automatic Mapping Among Lexico-Grammatical Annotation Models
نویسندگان
چکیده
The title of this paper playfully contrasts two rather different approaches to language analysis. The "Noisy Channel" 's are the promoters of statistically based approaches to language learning. Many of these studies are based on the Shannons's Noisy Channel model. The "Braying Donkey" 's are those oriented towards theoretically motivated language models. They are interested in any type of language expressions (such as the famous "Donkey Sentences"), regardless of their frequency in real language, because the focus is the study of human communication. In the past few years, we supported a more balanced approach. While our major concern is applicability to real NLP systems, we think that, after aLl, quantitative methods in Computational Linguistic should provide not only practical tools for language processing, but also some linguistic insight. Since, for sake of space, in this paper we cannot give any complete account of our research, we will present examples of "linguistically appealing", automatically acquired, lexical data (selectional restrictions of words) obtained trough an integrated use of knowledge-based and statistical techniques. We discuss the pros and cons of adding symbolic knowledge to the corpus linguistic recipe. 1. The "Noisy Channel" 's All the researchers in the field of Computational Linguistics, no matter what their specific interest may be, must have noticed the impetuous advance of the promoter of statistically based methods in linguistics. This is evident not only because of the growing number of papers in many Computational Linguistic conferences and journals, but also because of the many specific initiatives, such as workshops, special issues, and interest groups. An historical account of this "empirical renaissance" is provide in [Church and Mercer, 1993]. The general motivations are: availability of large on-line texts, on one side, emphasis on scalability and concrete deliverables, on the other side. We agree on the claim, supported by the authors, that statistical methods potentially outperform knowledge based methods in terms of coverage and human cost. The human cost., however, is not zero. Most statistically based methods either rely on a more or less shallow level of linguistic preprocessing, or they need non trivial human intervention for an initial estimate of the parameters (training). This applies in particular to statistical methods based on Shannon's Noisy Channel Model (n-gram models). As far as coverage is concerned~ so far no method described in literature could demonstrate an adequate coverage of the linguistic phenomena being studied. For example, in collocational analysis, statistically refiable associations are obtained only for a small fragment of the corpus. The problem of" "low counts" (i.e. linguistic patterns that were never, or rarely found) has not been analyzed appropriately in most papers, as convincingly demonstrated in [Dunning, 1993]. In addition, there are other performance figures, such as adequacy, accuracy and "linguistic appeal" of the acquired knowledge for a given application, for which the supremacy of statistics is not entirely demonstrated. Our major objection to purely statistically based approaches is in fact that they treat language expressions like stings of signals. At its extreme, this perspective may lead to results that by no means have practical interest, but give no contribution to the study of language.
منابع مشابه
Tags Re-ranking Using Multi-level Features in Automatic Image Annotation
Automatic image annotation is a process in which computer systems automatically assign the textual tags related with visual content to a query image. In most cases, inappropriate tags generated by the users as well as the images without any tags among the challenges available in this field have a negative effect on the query's result. In this paper, a new method is presented for automatic image...
متن کاملSemantic Annotation and Lexico-Syntactic Paraphrase
The IAMTC project (Interlingual Annotation of Multilingual Translation Corpora) is developing an interlingual representation framework for annotation of parallel corpora (English paired with Arabic, French, Hindi, Japanese, Korean, and Spanish) with deep-semantic representations. In particular, we are investigating meaning equivalent paraphrases involving conversives and non-literal language us...
متن کاملVisualisation of Long Distance Grammatical Collocation Patterns in Language
Research in generic unsupervised learning of language structure applied to the Search for ExtraTerrestrial Intelligence (SETI) and decipherment of unknown languages has sought to build up a generic picture of lexical and structural patterns characteristic of natural language. As part of this toolkit a generic system is required to facilitate the analysis of behavioural trends amongst selected p...
متن کاملThe Interface between Linguistic and Pragmatic Competence: The Case of Disagreement, Scolding, Requests, and Complaints
Second language learners often develop grammatical competence in the absence of concomitant pragmatic competence (Kasper & Roever, 2005) and the exact nature of the relationship between the two competences is still indistinct and in need of inquiries ( Bardovi-Harlig, 1999; Khatib & Ahmadisafa, 2011). This study is a partial attempt to address the lacuna and aims to see if any relationship ca...
متن کاملFuzzy Neighbor Voting for Automatic Image Annotation
With quick development of digital images and the availability of imaging tools, massive amounts of images are created. Therefore, efficient management and suitable retrieval, especially by computers, is one of themost challenging fields in image processing. Automatic image annotation (AIA) or refers to attaching words, keywords or comments to an image or to a selected part of it. In this paper,...
متن کامل